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Heterogeneous parallel primitives (HPP) ad 
dresses two major shortcomings in current 
GPGPU programming models: it supports 
full composability by defining abstrac- 
tions and increases flexibility in execution 
by introducing braided parallelism. 
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ith the growth in transistor counts in modern 
hardware, heterogeneous systems are 
becoming commonplace. Core counts are 
increasing such that GPU and CPU designs 
are reaching deep into the tens of cores. For performance 
reasons, different cores in a heterogeneous platform follow 
different design choices. Based on throughput computing 
goals, GPU cores tend to support wide vectors and sub- 
stantial register files. Current designs optimize CPU cores 
for latency, dedicating logic to caches and out-of-order 
dependence control. 

Heterogeneous platforms are clearly here to stay and 
will soon be ubiquitous. The problems that inevitably arise 
for hardware developers relate to programming— in par- 
ticular, making efficient use of such platforms. 

FUNDAMENTAL PROBLEMS 

Existing programming models attempt to satisfy 
some of the diverse requirements of heterogeneous 



platforms. GPU programming models, especially, 
have expanded over recent years to offer higher 
levels of flexibility. Both OpenCL (Open Comput- 
ing Language) 1 and CUDA (Compute Unified Device 
Architecture) 2 support heterogeneous platforms to 
some degree. To ensure that code can execute on 
various target platforms, these models employ a data- 
parallel methodology with weak communication 
guarantees. 

This relaxed approach leads to fundamental problems 
associated with 

• combining SPMD (single program, multiple data) 
programming with SIMD (single instruction, multiple 
data) execution, 

• braided parallelism, and 

• composability of operations. 

To date, most attempts to ease the programming 
burden for heterogeneous development have concentrated 
on API simplifications, such as those in CUDA over the 
graphics-oriented programming environments that 
preceded it, those in Microsoft's C++ AMP (Acceler- 
ated Massive Parallelism) 3 design that link the benefits 
of C++- type safety with GPU programming, or those in 
pragma-based models such as OpenACC. However, these 
models smooth the learning curve but do not address the 
fundamental problems. 
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SPMDon SIMD 

Heterogeneous developers follow an SPMD program- 
ming model but use a SIMD execution model, particularly 
for GPUs. GPU maker Nvidia refers to this SPMD-on-SIMD 
technique as single instruction, multiple thread (SIMT). 

Strict adherence to an SPMD model limits the sys- 
tem's flexibility. For example, OpenCL's memory model 
does not allow any communication between work groups 
without the use of atomic operations. No method is avail- 
able to guarantee that writes commit to global visibility, 
and there is very little control of memory ordering. 
CUDA offers only a partial solution with the threadfence 
operation. 

SIMD execution leads to problems too. For example, 
SIMT threads map to individual SIMD lanes in a larger 
hardware thread and use execution masks to switch be- 
tween subsets when control flow diverges. There are no 
assurances of progress in the presence of dependencies 
between lanes. CUDA's limited hardware space lets pro- 
grammers make assumptions about how wide a hardware 
thread is; OpenCL offers no such opportunity. 

Restrictive barriers within a divergent control flow are 
not necessarily due solely to hardware limitations— they 
can also be a consequence of the programming model 
itself. For example, Titanium prohibits barriers inside any 
divergent control flow. 4 Recent work on SPMD for modern 
CPUs avoids barriers in control flow altogether by defining 
a notion of maximal convergence— a guarantee that if two 
program instances follow the same control path, they will 
execute each program statement concurrently. 5 

Braided parallelism 

Both OpenCL and CUDA push data parallelism as the 
most important abstraction for parallel computation. In 
the real world, however, there are many forms of parallel- 
ism within applications. A game engine, for example, has 
parallel AI tasks, concurrent threads for user interfaces, 
massively data-parallel particle simulations, and so on. 
This mixture of task and data parallelism within the same 
graph is called braided parallelism. 6 Such applications are 
parallel only in terms of components; in general, they will 
dynamically generate irregular work. 

There is considerable research in this area to construct 
task-graph executions on GPUs. One such approach, 
known as persistent threads (PT), builds scheduling sys- 
tems within threads and thus circumvent the hardware 
scheduler. PT's benefits range from ray tracing 8 to simple 
global synchronization, 9 but it is most commonly used 
simply to reduce the communication overhead that arises 
from massively data-parallel executions. 10 

While PT is not bad in itself, the need for circumvention 
demonstrates a limitation of current programming models. 
Under some circumstances, particularly as time goes on 
and limiting power consumption becomes a core require- 



ment of hardware schedulers, such approaches may either 
lead to performance degradation or become infeasible as 
the operating system and hardware take control back from 
the developer. 

Composability of operations 

Work items in current data-parallel models that cover 
GPUs are divided into synchronizable groups that can 
share data. The only synchronization primitive exposed 
is a barrier that enforces both memory consistency and 
work-item ordering. The computation granularity requires 
frequent synchronization in complicated algorithms, yet 
the barrier operation is coarsely defined to work only 
within groups and does not generalize to global synchro- 
nization. This behavior precludes using barriers in most 
divergent control flows. 

In addition, many GPGPU programming models expose 
distinct address spaces, requiring the explicit movement 
of data in and out of these domains. 



I To enable performance and 
productivity on heterogeneous 
platforms, HPP balances safety, 
scalability, and flexibility. 



These two problems are most evident in the presence 
of third-party libraries. When calling the library, program- 
mers must be aware of its parameters' memory spaces 
and write additional data movement code if the library 
has unexpected requirements. More importantly, there is 
little or no way to enforce how the code calls library func- 
tions and over what width. In essence, programmers must 
assume the library functions execute across an entire work 
group or work on a single item. The former case allows 
the library to perform barrier synchronization and share 
state internally, and the latter explicitly does not support 
such sharing. 

HETEROGENEOUS PARALLEL PRIMITIVES 

Heterogeneous parallel primitives (HPP) is a braided par- 
allel programming model designed to support both task 
and data parallelism as first-class concepts. To avoid de- 
fining a completely new parallel programming model and 
language, and to improve familiarity, we embedded HPP in 
C++ 11 as a library and device kernel language. 

HPP aims to address some of the problems related to 
flexibility and composability. To enable performance 
and productivity on heterogeneous platforms, it balances 
safety, scalability, and flexibility. 

Safety. To aid productivity, HPP is intended to be safe. 
Due to performance requirements and the choice of C++ 
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as the base language, the set of guarantees is limited 
compared to some parallel languages such as X10. 11 
Furthermore, addressing deadlock and other programmer- 
introduced errors related to parallelism was not a design 
goal. While these are important issues, they are also dif- 
ficult to resolve and are currently areas of active research. 
Instead, we focused on using types to provide correctness 
guarantees with respect to values and usability. HPP relies 
on types when possible to provide static guarantees and 
to rule out initialization errors. 

Determinism is sometimes cited as a key safety fea- 
ture for a parallel programming language. 11 However, HPP 
explicitly defines a notion of asynchronous agents 12 and 
builds on nondeterminism as a foundational construct. 

There are two key reasons for this choice. 



SHPP relies on types when possible to 
provide static guarantees and to rule 
out initialization errors. 



First, because current GPUs and other accelerator types 
do not provide preemption, explicit asynchronous con- 
tinuations map more effectively to the architecture. In 
particular, executing kernels generate work, and rather 
than waiting for completion of the subtasks, they create 
and enqueue a continuation that specifies work to be ex- 
ecuted once the subtasks complete. 

Second, while reasoning about asynchronous program- 
ming can be more difficult, it also allows the developer to 
achieve close to peak performance for many designs. In 
particular, using deterministic models can require addi- 
tional runtime overhead or provide strict static guarantees, 
thus removing runtime overheads but limiting what 
can be expressed algorithmically. In many algorithms, 
nondeterministic behavior is acceptable— for example, 
branch-and-bound search, graph clustering, and many 
graphics and media processing applications. 

Scalability. HPP supports the development of scal- 
able applications: the addition of CPU cores, discrete and 
embedded GPU cores, and other computational resources 
leads to increased performance. 

Flexibility. HPP makes it easier for programmers to 
exploit the many forms of parallelism within scalable 
applications. 

TASK AND DATA PARALLELISM IN HPP 

Programmers can use HPP to easily introduce potential 
data and task parallelism. 

Consider the following naive function for multiplying 
two matrices: 



void matrixMul( 
int size, 
double * input A, 
double * inputB, 
double * output) 

{ 

for (int i = 0; i < size; ++i) { 
for (int j = 0; j < size; ++j) { 
double sum = 0; 

for (int k = 0; k < size; ++k) { 
double a = input A [i * size + k] ; 
double b = inputB[k * size + j] ; 
sum += a * b; 

} 

C[i * size + j] = sum; 

} 

} 

} 

In this example, the iteration spaces of the outer two for 
loops are independent of each other and the system can 
potentially execute them in parallel. A straightforward way 
to parallelize this algorithm is to use size* size work items, 
where each executes the inner loop with a corresponding 
index from the 2D iteration space. We refer to this as data- 
parallel execution. 

Using this approach, we can replace the outer two for 
loops of the matrix multiplication with a call to the func- 
tion parallelFor: 

void matrixMul( 
int size, 

Pointer<double> inputA, 
Pointer <double> inputB, 
Pointer<double> output) 

{ 

parallelFor( 

Range<2> (size , size) , 
[inputA, inputB, output] ( 

Index<2> index) device (hpp) { 

unsigned int i = index. getX() ; 
unsigned int j = index. getY() ; 

double sum = 0; 

for (unsigned int k = 0; k < size; ++k) { 
double a = inputA [i * size + k] ; 
double b = inputB[k * size + j] ; 
sum += a * b; 

} 

output [i * size + j] = sum; 

»); 

} 

This is similar to the data-parallel technique popularized 
by OpenMP and more recently by GPGPU programming 
models. However, there is more to data-parallel execution. 
HPP is a task-parallel runtime (TPR) that supports data 
parallelism as a first-class execution model. 

To execute efficiently on a heavily multithreaded wide- 
vector device such as a GPU, scheduling the execution as 
a small pool of tasks in the way that Intel's TBB 13 might is 
inadequate. Conversely, data-parallel execution does not 
capture many types of task-parallel execution: a network 
of communicating processes executing in a pipeline is a 
common example. 
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Figure I.The heterogeneous parallel primitives (HPP) model evolves OpenCL's device model from a single-threaded device 
(left) to a set of explicitly programmable work coordinators that can launch work units on the compute cores (right). 



Similar to popular TPRs designed specifically for the 
CPU, HPP tasks can be data-parallel. The difference is that 
tasks maintain data-parallel representations much later in 
the execution process, hence they map more efficiently to 
highly data-parallel architectures. 

Rewriting the previous example using tasks and the cor- 
responding notion of a future, which represents data that 
will be present at some point and thus acts as a proxy for 
synchronizing with asynchronous tasks, is straightforward: 

void matrixMul( 
int size, 

Pointer<double> inputA, 
Pointer <double> inputB, 
Pointer<double> output) 



{ 



TasKvoid, Index<2» matMul( 
[inputA , inputB , output] 
(Index<2> index) device (hpp) 

{ 

unsigned int i = index. getX() ; 
unsigned int j = index. getY() ; 

double sum = 0; 

for (unsigned int k = 0; k < size; ++k) { 
double a = inputA [i * size + k] ; 
double b = inputB[k * size + j] ; 
sum += a * b; 



} 

output [i * size + j] = sum; 



}); 



Future<void> future = 

matMul. enqueue (Range<2> (size, size)) ; 

future .wait (); 



HPP PROGRAMMING MODEL 

HPP integrates programming concepts from OpenCL, 
C++11, and Microsoft's Concurrency Runtime (ConcRT; 
http://msdn.microsoft.com/en-us/library/dd504870.aspx). 



In particular, HPP adopts OpenCL's execution model, 
extending it with braided parallelism; C++ll's hosting 
language; and a stricter and more controllable memory 
model. 

Model components 

The HPP programming model has three components: 
the platform model, the execution model, and the memory 
model. 

Platform model. HPP specifies an abstract hard- 
ware model that consists of one processor coordinating 
execution (the host) and one or more processors capa- 
ble of dispatching and executing kernels (the devices). 
To support both data and task parallelism, HPP evolves 
OpenCL's device model from a single-threaded device 
to a set of explicitly programmable work coordinators 
that can launch work units on the compute cores, as 
Figure 1 shows. 

Execution model. HPP defines how to configure the 
programming environment on the host and how kernels 
execute on the device. Unlike earlier GPGPU programming 
models, it supports both data and task parallelism as first- 
class execution models. 

Coordinators are single-thread scalar programs with 
limited functionality that execute on the coordination 
schedulers shown in Figure 1. Coordinators can read and 
write globally visible memory (including atomic opera- 
tions), manage conditional flow (including iteration), and 
dispatch kernels onto compute units (CUs). 

Kernel programs execute on CUs and assume an ex- 
plicitly parallel execution, wherein the written kernel 
describes the execution of a single lane or work item. On 
dispatch, many work items can share the same kernel 
code. 

Coordinators execute concurrently with kernels and 
thus can dispatch kernels while other kernels execute. 
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Programmers can organize work items into work groups 
of size 1 or greater. The system executes collections of 
work items within a work group in lock-step as part of an 
mvector (machine vector), potentially using predication. An 
mvector's specific length is implementation-defined and 
exposed as a symbolic constant (mvector_SIZE). 

Memory model. HPP defines an abstract memory hier- 
archy that kernels use, regardless of the actual underlying 
memory architecture. Unlike earlier GPGPU models, the 
HPP memory hierarchy is closer to a more traditional 
shared memory system. In particular, it does not explic- 
itly expose scratchpad memories. HPP adopts the C++11 
memory model for work-item communication. AMD's 
Graphics Core Next (GCN) architecture fully supports the 
set of ordered atomics that C++11 requires. 14 

The following example is a simple HPP application that 
atomically increments its input in parallel: 

^include <atomic> 

void inc(atomic_int &input, int numOf Times) 
{ 

parallelFor( 

Range<l>(numOf Times) , 

[input] (Index<l>) device (hpp) { 

input, add (1) ; 

}); 

} 

Tasks 

HPP provides asynchronous tasks that execute on a 
grid. The key difference between this approach and that 
embodied by OpenCL is that tasks encode the behavior of 
an asynchronous agent that can execute like a ConcRT- 
style task or as an OpenCL-style dispatch. 

A templated class represents tasks in HPP: 

template< 

typename ReturnType, 

typename IndexType > 
class Task 
{ 

public: 

ty pedef s td : : vector <ReturnType> 
ReturnDataType ; 

template< typename FunctionType > 
Task( FunctionType f ); 

template< 

typename T, 

typename RangeType > 
auto enqueue ( 

RangeType r, 

Future<T> ) -> Future<ReturnDataType>; 

}; 

As HPP is an asynchronous programming model, 
specifying intertask dependencies is the developer's re- 
sponsibility. The Future<T> type controls dependencies by 
encapsulating an initially unknown result that will become 
available at some later point. Calling the wait method on 



or assigning from a future causes the runtime system to 
wait on completion of execution of the underlying task and 
subsequently makes the result value available. 

For example, the following code shows the execution 
of two tasks, whose bodies are elided for space, combin- 
ing the resulting multiple futures into a single waited-on 
future: 

Future<int> fl = Task<int> (...). enqueue (...); 
Future<float> f2 = TasKfloat >(...). enqueue (...); 

Future<pair<int,float» f3 = fl && f2; 
f3.wait(); 

Distributed arrays 

Modern architectures' memory hierarchies are com- 
plex, either explicitly or implicitly exposing different 
levels and localities. An example is the explicitly managed 
scratchpad memory structure visible in OpenCL. Modern 
symmetric multiprocessing systems have similar proper- 
ties, such as nonuniform memory access (NUMA) locality. 
Without knowledge of cache layout, false sharing is a seri- 
ous issue for multithreaded applications. 

The partitioned global address space (PGAS) is a class of 
programming languages that assumes a single global ad- 
dress space that can be logically partitioned into regions. 
PGAS can allocate each region to a particular local pro- 
cessor. In OpenCL, scratchpad memories map parts of the 
global memory, with explicit loads and stores moving data 
in and out of these local regions. Global memory provides a 
shared and coherent view of all memory, while scratchpad 
memories provide "local" disjointed views that are inter- 
nally shared and coherent. 

In practice, devices implementing the OpenCL model 
have many more memories— for example, caches and 
on-chip global memories. Distributed arrays generalize 
this memory hierarchy into a PGAS abstraction of per- 
sistent user-managed memory regions, and they specify 
visibility— memory sharing and coherence— with respect 
to region nodes and their ancestors. An example use 
case is abstractly managing OpenCL's work-group local 
memory, as Figure 2 shows. However, more generally, the 
model provides a natural description of locality. 

HPP describes distributed arrays in terms of regions and 
segments. Regions are accessible entities that the runtime 
system can place into memory; a region defines a memory 
visibility constraint as a layer in the memory hierarchy. 
Segments are leaf memory allocations that the runtime 
system creates by distributing a region across a set of 
nodes in the execution graph. 

A region's division into segments is based on the number 
of subtasks created at the appropriate level of the hier- 
archy. Unlike global memory, distributed arrays that are 
bound to executions are always segmented. An executing 
kernel can access a bound segment from a particular work 
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group, but not necessarily from others. Figure 2 visualizes 
this process. 

Distributed arrays are represented as the type: 

template< 

typename T = void, 
bool Persistent = true, 
template <class Type> AccessPattern 
= ScatterGather> 

class DistArray 

{ 

}; 

On creation, a distributed array is unbound, and the 
runtime system can allocate abstract regions and sub- 
regions. After being passed to a kernel, the array becomes 
bound and is matched by a corresponding kernel argument 
of the form: 

template< 

typename T = void, 

template <class Type> AccessPattern 
= ScatterGather> 
class BoundDist Array 
{ 



getRegion(Region<T>) ; 



}; 



Within the kernel, the programmer can access a specific 
region using getRegionQ, returning: 

template < 
typename T, 

template<typename Type> class AccessPattern 
= StructuredArrayAccess> 
class Region : public AccessPattern<Type> 
{ 

size_t getRegionSizeQ; 



The parameter AccessPattern defines a region's inter- 
face. For example, StructuredArrayAccess defines a Fortran 
array-style interface exposing [] , along with members to 
support array slicing and transformations. 

The following example shows how a developer might 
use distributed arrays: 

DistArray<float> darray; 

Region<float> region; 
region = 

darray. allocRegion (darray. getMaxRegionSizeO) ; 

parallelFor( 
Range<l,l>( 

darray. getTotalSizeO, 

Range<l>(region.getSize())), 
darray, 
[region] ( 

Index<l> i, 



Linear access to whole array 



Unbound 



Bound 




Linear access to region 



Unbound 



0 + 0,1 + 1, 
2 + 2,... 









Figure 2. Distributed array access: unbound, bound, and 
unbound again. 



BoundDist Array<f loat> a) device (hpp) 



}); 



a(region) [index. getLocalX()] += 
getXQ; 



index. 



In this example, the code allocates a single region and 
then binds it in the kernel's execution. It uses the local 
work-group ID index for each work item to access the 
region within the kernel. This highlights a key feature of 
distributed arrays: as coherence is described in terms of 
ancestors, allocating an independent region to each work 
group is safe. 

Our current HPP implementation moves regions, when 
they fit, into on-chip scratchpad memories on the GPU, and 
it will perform cache prefetching on the CPU in some cases. 
A more optimal implementation might consider moving 
certain regions, depending on their location in the region 
tree, into scratchpad memories or, more generally, moving 
a family of regions whose access is known to be limited to 
a particular core. 

CHANNELS 

While HPP uses GPUs for general-purpose comput- 
ing, they are designed primarily to process graphics 
workloads, which are essentially dataflow pipelines. For 
example, the pipeline in Microsoft's Direct3Dll tessel- 
lation flow comprising hull shading, tessellation, and 
domain shading (http://msdn.microsoft.com/en-us/library/ 
windows/desktop/ff476340) can amplify or consume work 
at each stage. The hull shader specifies tessellation factors 
for edges of a triangle such that the tessellator might, say, 
divide that triangle into many more to provide greater 
detail of an object nearer the viewer. 
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Figure 3. HPP's channel data-flow process. The process minimizes resource usage through fine-grained threading by creating 
consumers when there is source data and exiting producers once data is written. 



Hardware scheduling and memory buffers handle 
these workloads efficiently and are optimized to main- 
tain a high level of utilization, executing just enough 
work to keep each stage of the pipeline busy without 
starvation. Unfortunately, traditional GPU nongraphics 
computing models do not offer a mechanism to access 
this capability. 

As its target hardware is designed to manage this sort 
of pipeline, HPP exposes this feature to the programmer. 
To that end, HPP applies the communication channels 
concept to dynamic scheduling systems. Given the mas- 
sively data-parallel nature of GPU dispatches, the usual 
approach is for the scheduler to issue more work as 
resources become available. HPP maintains this approach 
such that rather than utilizing blocking reads, it creates 
a fine-grained consumer at the point of read. Various 
CPU task-oriented runtime systems such as the agents 
library that runs on top of ConcRT use a similar 
approach. 



Data flow 

Figure 3 illustrates HPP's channel data-flow process. 
The process involves a kernel, command queue, chan- 
nel, and control processor (panel 1). Enqueueing a kernel 
(panel 2) launches a set of work items (panel 3). These 
work items write into the channel (panel 4), which then 
contains data (panel 5). The control processor detects a 
launch condition for the channel (panel 6) and launches 
consumer work items (panel 7), which then consume 
the channel's contents (panel 8). The process continues 
as the next set of work items write into the channel 
(panel 9). 

Jeremy Sugerman and colleagues 15 introduced a simi- 
lar data-flow model, GRAMPS, that generalizes concepts 
from real-time graphics pipelines by exposing fixed- 
function and programmable processing stages linked via 
data queues. HPP channels are akin to GRAMPS' queues, 
but the mechanism for describing the coordination lan- 
guage and scheduling differs. 
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HPP defines a subset of the channel interface as follows: 

template<class T> 
class Channel 
{ 

public: 

Channel (size_t) ; 

template<typename F> 
void executeWith( 

Coordinator const& coord, 

Range<l> r, 

F f); 

size_t size(); 

void write (const T& v); 

}; 

The executeWith method associates a coordinator 
predicate that returns true if the runtime system should 
dispatch the corresponding consumer kernel. Addition- 
ally, the channel write method blocks if the channel is full, 
allowing consumers to reduce the amount of data stored 
in the channel before continuing. HPP aims to lock a chan- 
nel's data store into an on-chip cache and thus cannot be 
overly large. The advantage is good producer/consumer 
data locality. 

Coordinators are control programs describing when 
to trigger consumers. HPP expresses them as a restricted 
domain-specific language, embedded into C++. 

Example 

The following example, which calculates a global re- 
duction, 10 illustrates the use of distributed arrays for local 
communication and channels for global communication 
in HPP. For simplicity, input size is a multiple of mvector_ 
size. The example uses a single distributed array with two 
disjoint regions and a single channel to store the results 
of each work group's reduction, with a trigger executing 
a second kernel to reduce the resulting channel data, once 
full: 

int channelSize = 32; 
vector<int> input = ... ; 
Channel<int> results (channelSize) ; 
DistArray<int> darray; 

Region<float> regionl; // used in the 1st pass 
Region<float> region2; // used in the 2nd pass 

regionl = 

darray. allocRegion(MVECTOR_SIZE) ; 

region2 = 

darray. allocRegion (channelSize) ; 

int result; 

results . executeWith ( 



[=] (Channel<int>* c) -> bool device (coord) { 
return c->size() == numWorkGroups; 

}, 

Range<l , 1> (channelSize , channelSize) , 
darray, 

[&result, region2] ( 
Index<l,l> index, 
BoundDistArray<float> a) 
vector<int> v) device (hpp) 

{ 

int accumulator = 0; 

int id = index. getLocalX( ) ; 

Segment<f loat> seg = a(region); 

seg[id] = v[id]; 
seg. barrier () ; 

for (int offset = get_local_size(0) / 2; 
offset > 0; 
offset = offset / 2) 

{ 

if (id < offset) { 

int other = seg [id + offset]; 
int mine = seg [id]; 
seg [id] = mine + other; 

} 

seg. barrier (); 

} 

if (id == 0) { 

*result = seq[0]; 

} 

}}); 

parallelFor( 

Range<l , 1> (input . size () , MVECTOR_SIZE) , 
darray 

[&results, input] ( 
Index<l,l> index, 

BoundDistArray<f loat> a) device (hpp) { 

// parallel reduce kernel body here 
} 

}); 



BARRIERS 

Coordinating shared data is critical in parallel 
programs that scale, and GPGPU programming models 
are no exception. Current GPGPU solutions limit 
synchronization via barrier operations to memory con- 
sistency and work items reaching the same PC, but are 
limited to use cases that do not contain divergent control 
flow. Using a barrier in control flow is technically fea- 
sible, but the programmer must guarantee that all work 
items enter the conditional statement if any work item 
enters the conditional statement and executes a barrier. 

Barrier objects 

HPP addresses these limitations by introducing barriers 
as first-class values that can be used in control flow and 
across work groups. 

Barrier objects support the following interface: 

class Barrier 
{ 
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public: 

Barrier (size_t count); 

void skip(); 
void wait(); 
void arriveO; 

}; 

Work items initialize a barrier with a count that rep- 
resents the number of barrier participants. The three 
operations have the following semantics: 

• Any work item that performs a skip ( ) withdraws from 
further barrier participation and thus does not count 
against other participants waiting. A common use 
case is early exit from a loop such that other work 
items can continue synchronizing on the barrier after 
one leaves. 

• Any work item that performs a wait ( ) is blocked from 
continuing execution until the other participants have 
also taken part. This is a common consumer action. 

• Any work item that performs an arrive ( ) participates 
in the barrier but does not wait for other work items. 
This is a common producer action. 



I While barrier objects allow for 
function composition, developers 
must carefully control their use. 



Barriers in control flow fall out directly by using wait () 
or arrive (), combined with skip(), as the following ex- 
ample shows. Work items that enter the else branch or 
exit the loop call skipQ, removing them from the execu- 
tion. The remaining work items can continue iterating and 
communicating through scratch and wait () on the barrier. 

Barrier b(8); 
parallelFor (Range<l>, 

[&b, scratch] (Index<l> i) { 
scratch[i.getX()] = i.getX(); 
if( i.getXO < 4 ) { 

for( int j = 0; j < i.getXO; ++j ) { 
b.waitO ; 

x[i.getX()] += scratch[j+l]; 

} 

b.skipO; 
} else { 
b.skipO; 
x[i.getX()] = 17; 

} 

»; 

By passing barrier objects to functions and skipping 
elsewhere, synchronizing those functions on the barrier 
without dependencies on external work items is safe: 



void someOpaqueLibraryFunction( 

const int i, Barrier &b); 
Barrier b(8); 
parallelFor( 

Range<l>, 

[&b, scratch] (Index<l> i) { 
scratch [i] = val; 
if( i.getXO < 4 ) { 

someOpaqueLibraryFunction(i, b); 
} else { 

b.skipO; 

x[i.getX()] = 17; 

} 

»; 

While barrier objects allow for function composition, 
developers must carefully control their use. In the code 
above, for example, replacing the call to skipO in the 
else branch with wait () is not generally valid because it 
is impossible to know how many times someOpaqueLibary 
Function will use the barrier. The solution in this case is 
to use two barriers: 

Barrier b(8); 
Barrier b2(8); 
parallelFor (Range<l>, 

[&b, &b2, scratch] (Index<l> i) { 
scratch [i] = val; 
if ( i < 4 ) { 

someOpaqueLibraryFunction(i, b); 
b2.wait(); 
} else { 
b.skipO; 
b2.wait(); 
x[i] = 17; 

} 

}); 

A potential use case for barrier objects is to support 
synchronization between dependent kernels. For exam- 
ple, Shucai Xioa and Wu-chun Feng 9 describe the host 
pattern 

for(...) { 

parallelFor (Range<l>(N), foo); 

} 

where there is an implicit synchronization following 
each invocation of parallelFor, with the intention of 
pushing the for loop onto the device. The goal is to 
reduce the cost of synchronization between the host 
and device: 

void foo(Index<l> index, ...) device(hpp) 

{ 

for(...) { 

foo (index, ...); 
gpu_sync(); 

} 

} 

where gpu_sync() is an inter-work-group barrier 

operation. 
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Implementation 

We used the global data share (GDS) in AMD's HD7970 
GPU to implement BarrierObj, a cross-work-group vari- 
ant of HPP's barrier objects. GDS is a 64-Kbyte on-chip 
global memory with barrier functionality across the entire 
device. We also used the XFBarrier algorithm to implement 

gpu_sync(). Figure 4 compares these implementations 

with CPU Barrier, a synthetic benchmark that calculates 
the mean of two floats 10,000 times. 9 Note the close 
correlation between the base compute number (no syn- 
chronization) and the results of the synchronized version 
using barrier objects. 

A limitation of the XFBarrier algorithm is the inability 
to move the following loop onto the device: 

for(...) { 

parallelFor (Range<l> (N) , f oo) ; 
parallelFor (Range<l> (M) , bar) ; 

} 

where N ± M. Using either channels or tasks, this pattern 
can be easily expressed in HPP and is an example of how 
the model goes beyond what is expressible in the general 
framework of persistent threads. 7 

Another, critical, limitation of the XFB algorithm is the 
inability to synchronize within control flow. Barrier objects 
have no such restriction, though implementing them on a 
SIMD machine for use in control flow raises issues. 

Implementation is straightforward if the programmer 
guarantees to diverge on mvector subgroups of work items. 
One caveat is that communicating mvectors must run 
concurrently. The common approach to handling this in 
software is to issue just enough mvectors to coschedule; 
however, this limits barrier use cases to PT applications 
only and does not easily account for the possibility of hard- 
ware schedulers behaving unpredictably. 

The implementation of barrier objects on AMD's GCN 
hardware takes an alternative approach that allows 
unlimited mvectors. To avoid deadlocking with "busy" 
waiting mvectors, it uses GCN's sleep N to indicate to 
the hardware to deschedule the mvector, allowing other 
mvectors in the work group to run. To avoid atomic load 
contention, the implementation enforces a naive back- 
off scheme by selecting a different sleep period for each 
mvector. 

Implementing barrier objects requires hardware sup- 
port for unstructured control flow— in particular, the 
ability to directly branch to the instruction following a 
previous waitO and reenable the vector execution mask. 
For this reason, modern GPU architectures such as GCN 
support sub-mvector width divergence. In the future, 
we expect to see direct hardware support of this opera- 
tion, providing acceleration structures for implementing 
full MIMD (multiple instruction, multiple data) behavior 
on SIMD execution. For example, AMD's heterogeneous 
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Figure 4. Microbenchmark execution time comparison. 
The Compute line assumes no synchronization. The other 
lines refer to various synchronization options: XFBarrier 
uses the gpu_sync ( ) implementation; CPU Barrier syn- 
chronizes using separate kernel dispatches; and BarrierObj 
uses HPP's hardware-backed global barriers. 



system architecture explicitly supports barrier objects (re- 
named "flow barriers"). 16 

With the success of programming models such 
as OpenCL and CUDA, heterogeneous comput- 
ing is becoming mainstream. However, current 
systems are low-level and not composable, and behavior 
is often implementation-defined even for standardized 
models. 

Heterogeneous parallel primitives is an object-oriented, 
C++ll-based programming model that addresses these 
shortcomings on both CPUs and massively multithreaded 
GPUs: it supports full composability by defining abstrac- 
tions using distributed arrays and barrier objects, and it 
increases flexibility in execution by introducing braided 
parallelism. 

We have implemented a feature-complete version of 
HPP, including all syntactic constructs, that runs on top of 
a task-parallel runtime executing on the CPU. We continue 
to develop and improve the model, including reducing 
overhead due to channel management, and plan to make 
a public version available sometime in the future. H 
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